Lab 3: Modeling correlation and regression

Practice session covering topics discussed in Lecture 1

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

The examples and code from this lab session follow very closely the book:

Topics discussed in Lecture # 3

understand

Lecture 3: topics

  • Testing for a correlation hypothesis (relationship of variables)
    • Pearson rho analysis (param)
    • Spearman test (no param)
  • Measures of association
    • Fisher’s Exact Test
    • Chi-Square Test of Independence
  • From correlation/association to causation
    • introduction to experiments
      • Example: Linear regression models
      • Example: Multiple Linear Regression
  • From causation to prediction
    • introduction to Machine Learning
      • Supervised algorithms
      • Unsupervised algorithms

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
library(janitor) # tools for examining and cleaning data
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting

# Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(tidymodels) # not installed on this machine
# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggpubr) # 'ggplot2' Based Publication Ready Plots

# Data 
# devtools::install_github("OI-Biostat/oi_biostat_data")
#library(oibiostat) 
# load some colors
colors <- readRDS(here::here("practice", "data_input", "03_datasets","colors.rds"))

DATASETS for today

Importing Dataset 1 (NHANES)

  • Adapting the function here to match your own folder structure

Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1
Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
nhanes_samp <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "nhanes_samp.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- NHANES Variables and their description

Variable Type Description
X int xxxx
ID int xxxxx
SurveyYr chr yyyy_mm. Ex. 2011_12
Gender chr Gender (sex) of study participant coded as male or female
Age int ##
AgeDecade chr yy-yy es 20-29
Race1 chr Reported race of study participant: Mexican, Hispanic, White, Black, or Other
Race3 chr Reported race of study participant... Not availale for 2009-10
Education chr [>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
MaritalStatus chr [>= 20 yro]. Ex. Married, Widowed, Divorced, Separated, NeverMarried, or LivePartner
HHIncome chr Total annual gross income for the household in US dollars
HHIncomeMid int Numerical version of HHIncome derived from the middle income in each category. Ex. 12500 40000
Poverty dbl A ratio of family income to poverty guidelines. Smaller numbers indicate more poverty Ex.. 0.95 1.74 4.99
HomeRooms int How many rooms are in home of study participant (counting kitchen but not bath room).
HomeOwn chr One of Home, Rent, or Other
Work chr NotWorking Working
Weight dbl Weight in kg
Height dbl Standing height in cm. Reported for participants aged 2 years or older.
BMI dbl Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse int 60 second pulse rate
BPSysAve int Combined systolic blood pressure reading, following the procedure outlined for BPXSAR
BPDiaAve int Combined diastolic blood pressure reading, following the procedure outlined for BPXDAR
BPSys1 int Systolic blood pressure in mm Hg – first reading
BPDia1 int Diastolic blood pressure in mm Hg – second reading (consecutive readings)
BPSys2 int Systolic blood pressure in mm Hg – second reading (consecutive readings)
BPDia2 int Diastolic blood pressure in mm Hg – second reading
BPSys3 int Systolic blood pressure in mm Hg third reading (consecutive readings)
BPDia3 int Diastolic blood pressure in mm Hg – third reading (consecutive readings)
Testosterone dbl Testerone total (ng/dL). Reported for participants aged 6 years or older. Not available for 2009-2010
DirectChol dbl Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol dbl Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
UrineVol1 int Urine volume in mL – first test. Reported for participants aged 6 years or older
UrineFlow1 dbl Urine flow rate in mL/min – first test. Reported for participants aged 6 years or older
UrineVol2 int Urine volume in mL – second test
UrineFlow2 dbl Urine flow rate (urine volume/time since last urination) in mL/min – second test
Diabetes chr Study participant told by a doctor or health professional that they have diabetes
DiabetesAge int Age of study participant when first told they had diabetes
HealthGen chr Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
DaysPhysHlthBad int Self-reported # of days participant’s physical health was not good out of the past 30 days
DaysMentHlthBad int Self-reported # of days participant’s mental health was not good out of the past 30 days
LittleInterest chr Self-reported # of days where participant had little interest in doing things. Among: None, Several, Majority, or AlmostAll
Depressed chr Self-reported # of days where participant felt down, depressed or hopeless. Among: None, Several, Majority, or AlmostAll
nPregnancies int # times participant has been pregnant
nBabies int # deliveries resulted in live births
PregnantNow chr Pregnancy status ascertained for females 8-59 years of age
Age1stBaby int Age of participant at time of first live birth
SleepHrsNight int Self-reported # of hours study participant gets at night on weekdays or workdays. For participants aged 16 years and older
SleepTrouble chr Participant [16 years and older] has had trouble sleeping. Coded as Yes or No.
PhysActive chr Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
PhysActiveDays int Number of days in a typical week that participant does moderate or vigorous intensity activity.
TVHrsDay chr Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDay chr Number of hours per day on average participant used a computer or gaming device over the past 30 day
TVHrsDayChild lgl [2-11 yro] Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDayChild lgl [2-11 yro] Number of hours per day on average participant used a computer or gaming device over the past 30 day
Alcohol12PlusYr chr Participant has consumed at least 12 drinks of any type of alcoholic beverage in any one year
AlcoholDay int Average number of drinks consumed on days that participant drank alcoholic beverages
AlcoholYear int [>+ 18yro] Estimated number of days over the past year that participant drank alcoholic beverages
SmokeNow chr Study participant currently smokes cigarettes regularly. (Yes or No)
Smoke100 chr Study participant has smoked at least 100 cigarettes in their entire life. (Yes pr No)
Smoke100n chr Smoker Non-Smoker
SmokeAge int Age study participant first started to smoke cigarettes fairly regularly
Marijuana chr Participant has tried marijuana
AgeFirstMarij int Age Participant has tried marijuana first
RegularMarij chr Participant has been/is a regular marijuana user (used at least once a month for a year) (Yes or No)
AgeRegMarij int Age of participant when first started regularly using marijuana
HardDrugs chr Participant has tried cocaine, crack cocaine, heroin or methamphetamine (Yes or No)
SexEver chr Participant had had sex (Yes or No)
SexAge int Age Participant had had sex first time
SexNumPartnLife int Number of opposite sex partners participant has had
SexNumPartYear int Number of opposite sex partners over the past 12 months
SameSex chr Participant has had any kind of sex with a same sex partne(Yes or No)
SexOrientation chr Participant’s sexual orientation One of Heterosexual, Homosexual, Bisexual

Importing Dataset 2 (PREVEND)

Data from the Prevention of Renal and Vascular End-stage Disease (PREVEND) study, which took place in the Netherlands. The study collected various demographic and cardiovascular risk factors. This dataset is from the third survey, which participants completed in 2003-2006; data is provided for 4,095 individuals who completed cognitive testing.

Name: PREVEND (Prevention of REnal and Vascular END-stage Disease) is a study which took place in the Netherlands on 8,592 participants aged 28-75, with subsequent follow-ups in 1997-1998. A second survey was conducted in 2001-2003 on 6,894 participants, 5,862 completed the third survey in 2003-2006 (here measurement of cognitive function was added to the study protocol).
Documentation: dataset2
Sampling details: Data from 4,095 individuals who completed cognitive testing are in the prevend dataset, available in the R package oibiostat.

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
prevend <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "prevend.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- PREVEND Variables and their description

Variable Description
... ...

Importing Dataset 3 (FAMuSS)

Name: FAMuSS (Functional SNPs Associated with Muscle Size and Strength) examine the association of demographic, physiological and genetic characteristics with muscle strength – including data on race and genotype at a specific locus on the ACTN3 gene (the “sports gene”).
Documentation: dataset3
Sampling details: the DATASET includes 595 observations on 9 variables

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
famuss <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "famuss.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- FAMuSS Variables and their description

Variable Description
X id
ndrm.ch Percent change in strength in the non-dominant arm
drm.ch Percent change in strength in the dominant arm
sex Sex of the participant
age Age in years
race Recorded as African Am (African American), Caucasian, Asian, Hispanic, Other
height Height in inches
weight Weight in pounds
actn3.r577x Genotype at the location r577x in the ACTN3 gene.
bmi Body Mass Index

CORRELATION

Explore relationships between two variables

Approaches for summarizing relationships between two variables vary depending on variable types…

  • Two numerical variables
  • Two categorical variables
  • One numerical variable and one categorical variable

Two variables \(x\) and \(y\) are

  • positively associated if \(y\) increases as \(x\) increases.
  • negatively associated if \(y\) decreases as \(x\) increases.

TWO NUMERICAL VARIABLES (NHANES)

Two numerical variables (plot)

Height and weight (taken from the nhanes_samp dataset) are positively associated.

  • notice we can also use the generic function base::plot for a simple scatter plot
# rename for convenience
nhanes <- nhanes_samp %>% 
  janitor::clean_names()

# basis plot 
plot(nhanes$height, nhanes$weight,
     xlab = "Height (cm)", ylab = "Weight (kg)", cex = 0.8)  

Two numerical variables (plot)

Two numerical variables: correlation (with stats::cor)

Correlation is a numerical summary that measures the strength of a linear relationship between two variables.

  • The correlation coefficient \(r\) takes on values between \(-1\) and \(1\).

  • The closer \(r\) is to \(\pm 1\), the stronger the linear association.

  • Here we compute the Pearson rho (parametric), with the function cor

    • notice the use argument to choose how to deal with missing values (in this case only using all complete pairs)
is.numeric(nhanes$height) 
[1] TRUE
is.numeric(nhanes$weight)
[1] TRUE
# using `stats` package
cor(x = nhanes$height, y =  nhanes$weight, 
    # argument for dealing with missing values
    use = "pairwise.complete.obs",
    method = "pearson")
[1] 0.4102269

Two numerical variables: correlation (or with stats::cor.test)

  • Here we compute the Pearson rho (parametric), with the function cor.test (the same we used for testing paired samples)
    • implicitely takes care on NAs
# using `stats` package 
cor_test_result <- cor.test(x = nhanes$height, y =  nhanes$weight, 
                            method = "pearson")

# looking at the cor estimate
cor_test_result[["estimate"]][["cor"]]
[1] 0.4102269
  • The function ggpubr::ggscatter gives us all in one (scatter plot + \(r\) (“R”))
library("ggpubr") # 'ggplot2' Based Publication Ready Plots
ggpubr::ggscatter(nhanes, x = "height", y = "weight", 
                  cor.coef = TRUE, cor.method = "pearson", #cor.coef.coord = 2,
                  xlab = "Height (in)", ylab = "Weight (lb)")

Two numerical variables: correlation (or with stats::cor.test)

Spearman rank-order correlation

The Spearman’s rank-order correlation is the nonparametric version of the Pearson correlation.

Spearman’s correlation coefficient, (\(ρ\), also signified by \(rs\)) measures the strength and direction of association between two ranked variables.

  • used when 2 variables have a non-linear relationship
  • excellent for ordinal data (when Pearson’s is not appropriate), i.e. Likert scale items

To compute it, we simply calculate Pearson’s correlation of the rankings of the raw data (instead of the data).

Spearman rank-order correlation (example)

Let’s say we want to get Spearman’s correlation with ordinal factors Education and HealthGen in the NHANES sample. I have to convert them to their underlying numeric code, to compare rankings.

tabyl(nhanes$education)
 nhanes$education   n percent valid_percent
        8th Grade  32   0.064    0.06412826
   9 - 11th Grade  68   0.136    0.13627255
     College Grad 157   0.314    0.31462926
      High School  94   0.188    0.18837675
     Some College 148   0.296    0.29659319
             <NA>   1   0.002            NA
tabyl(nhanes$health_gen)
 nhanes$health_gen   n percent valid_percent
         Excellent  47   0.094    0.10444444
              Fair  53   0.106    0.11777778
              Good 177   0.354    0.39333333
              Poor  11   0.022    0.02444444
             Vgood 162   0.324    0.36000000
              <NA>  50   0.100            NA
nhanes <- nhanes %>% 
  # reorder education
  mutate (edu_ord = factor (education, 
                            levels = c("8th Grade", "9 - 11th Grade",
                                       "High School", "Some College",
                                       "College Grad" , NA))) %>%  
  # create edu_rank 
  mutate (edu_rank = as.numeric(edu_ord)) %>% 
  # reorder health education
  mutate (health_ord = factor (health_gen, 
                            levels = c( NA, "Poor", "Fair",
                                       "Good", "Vgood",
                                       "Excellent"))) %>%
  # create health_rank 
  mutate (health_rank = as.numeric(health_ord)) 

table(nhanes$edu_ord, useNA = "ifany" )

     8th Grade 9 - 11th Grade    High School   Some College   College Grad 
            32             68             94            148            157 
          <NA> 
             1 
table(nhanes$edu_rank, useNA = "ifany" )

   1    2    3    4    5 <NA> 
  32   68   94  148  157    1 
table(nhanes$health_ord, useNA = "ifany" )

     Poor      Fair      Good     Vgood Excellent      <NA> 
       11        53       177       162        47        50 
table(nhanes$health_rank,  useNA = "ifany" )

   1    2    3    4    5 <NA> 
  11   53  177  162   47   50 

Spearman rank-order correlation (example cont.)

Now we can actually compute it

  • After setting up the variables correctly (as.numeric), just specify method = "spearman"
# using `stats` package 
cor_test_result_sp <- cor.test(x = nhanes$edu_rank,
                               y = nhanes$health_rank, 
                               method = "spearman", 
                               exact = FALSE) # removes the Ties message warning 
# looking at the cor estimate
cor_test_result_sp

    Spearman's rank correlation rho

data:  nhanes$edu_rank and nhanes$health_rank
S = 10641203, p-value = 1.915e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2946493 
#cor_test_result_sp[["estimate"]][["rho"]]

TWO CATEGORICAL VARIABLES (FAMuSS)

Two categorical variables (plot)

In the famuss dataset, the variables race, and actn3.r577x are categorical variables.

## genotypes as columns
genotype.race = matrix(table(famuss$actn3.r577x, famuss$race), ncol=3, byrow=T)
colnames(genotype.race) = c("CC", "CT", "TT")
rownames(genotype.race) = c("African Am", "Asian", "Caucasian", "Hispanic", "Other")

barplot(genotype.race, col = colors[c(7, 4, 1, 2, 3)], ylim=c(0,300), width=2)
legend("topright", inset=c(.05, 0), fill=colors[c(7, 4, 1, 2, 3)], 
       legend=rownames(genotype.race))

Two categorical variables (contingency table)

Specifically, the variable actn3.r577x takes on three possible levels (CC, CT, or TT) which indicate the distribution of genotype at location r577x on the ACTN3 gene for the FAMuSS study participants.

A contingency table summarizes data for two categorical variables.

  • the function stats::addmargins puts arbitrary Margins on Multidimensional Tables
    • The extra column & row "Sum" provide the marginal totals across each row and each column, respectively
# levels of actn3.r577x
table(famuss$actn3.r577x)

 CC  CT  TT 
173 261 161 
# contingency table to summarize race and actn3.r577x
addmargins(table(famuss$race, famuss$actn3.r577x))
            
              CC  CT  TT Sum
  African Am  16   6   5  27
  Asian       21  18  16  55
  Caucasian  125 216 126 467
  Hispanic     4  10   9  23
  Other        7  11   5  23
  Sum        173 261 161 595

Two categorical variables (contingency table prop)

Contingency tables can also be converted to show proportions. Since there are 2 variables, it is necessary to specify whether the proportions are calculated according to the row variable or the column variable. + using the margin = argument in the base::prop.table function (1 indicates rows, 2 indicates columns)

# adding row proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x), margin =  1))
            
                    CC        CT        TT       Sum
  African Am 0.5925926 0.2222222 0.1851852 1.0000000
  Asian      0.3818182 0.3272727 0.2909091 1.0000000
  Caucasian  0.2676660 0.4625268 0.2698073 1.0000000
  Hispanic   0.1739130 0.4347826 0.3913043 1.0000000
  Other      0.3043478 0.4782609 0.2173913 1.0000000
  Sum        1.7203376 1.9250652 1.3545972 5.0000000
# adding column proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x),margin =  2))
            
                     CC         CT         TT        Sum
  African Am 0.09248555 0.02298851 0.03105590 0.14652996
  Asian      0.12138728 0.06896552 0.09937888 0.28973168
  Caucasian  0.72254335 0.82758621 0.78260870 2.33273826
  Hispanic   0.02312139 0.03831418 0.05590062 0.11733618
  Other      0.04046243 0.04214559 0.03105590 0.11366392
  Sum        1.00000000 1.00000000 1.00000000 3.00000000

Chi Squared test of independence

The Chi-squared test is a hypothesis test used to determine whether there is a relationship between two categorical variables.

  • categorical vars. can have nominal or ordinal measurement scale
  • the observed frequencies are compared with the expected frequencies and their deviations are examined.
# Chi-squared test
# (Test of association to see if 
# H0: the 2 cat var (race  & actn3.r577x ) are independent
# H1: the 2 cat var are correlated in __some way__

tab <- table(famuss$race, famuss$actn3.r577x)
test_chi <- chisq.test(tab)

the obtained result (test_chi) is a list of objects…

You try…

…run View(test_chi) to check

Chi Squared test of independence (cont)

Within test_chi results there are:

  • Observed frequencies = how often a combination occurs in our sample
# Observed frequencies
test_chi$observed
            
              CC  CT  TT
  African Am  16   6   5
  Asian       21  18  16
  Caucasian  125 216 126
  Hispanic     4  10   9
  Other        7  11   5
  • Expected frequencies = what would it be if the 2 vars were PERFECTLY INDEPENDENT
# Expected frequencies
round(test_chi$expected  , digits = 1 )
            
                CC    CT    TT
  African Am   7.9  11.8   7.3
  Asian       16.0  24.1  14.9
  Caucasian  135.8 204.9 126.4
  Hispanic     6.7  10.1   6.2
  Other        6.7  10.1   6.2

Chi Squared test of independence (results)

  • Recall that
    • \(H_{0}\): the 2 cat. var. are independent
    • \(H_{1}\): the 2 cat. var. are correlated in some way
  • The result of Chi-Square test represents a comparison of the above two tables (observed v. expected):
    • p-value = 0.01286 smaller than α = 0.05 so we REJECT the null hypothesis
test_chi

    Pearson's Chi-squared test

data:  tab
X-squared = 19.4, df = 8, p-value = 0.01286

Computing Cramer’s V after test of independence (by hand)

Recall that Crammer’s V is a way in which we can measure effect size of the test of independence (i.e. a measure of the strength of association between two nominal variables)

  • V ranges from [0 1] (the smaller V, the lower the correlation)

\[V=\sqrt{\frac{\chi^2}{n(k-1)}} \]

where:

  • \(V\) denotes Cramér’s V
  • \(\chi^2\) is the Pearson chi-square statistic from the aforementioned test;
  • \(n\) is the sample size involved in the test and
  • \(k\) is the lesser number of categories of either variable

Computing Cramer’s V after test of independence (2 ways)

  • ✍🏻 By hand” first to see the steps
# Compute Creamer's V
 
# inputs 
chi_calc <- test_chi$statistic
n <- nrow(famuss) # N of obd 
n_r <- nrow(test_chi$observed) # number of rows in the contingency table
n_c <- ncol(test_chi$observed) # number of columns in the contingency table

# Cramer’s V
sqrt(chi_calc / (n*min(n_r -1, n_c -1)) )
X-squared 
0.1276816 
# Cramer’s V 
rstatix::cramer_v(test_chi$observed)
[1] 0.1276816

A Cramer’s V of 0.12 indicates a relatively weak association between the two categorical variables. It suggests that while there may be some relationship between the variables, it is not particularly strong

Chi Squared test of goodness of fit

In this case, we are conducting a type of Pearson’s Chi-square test

  • used to test whether the observed distribution of a categorical variable differs from your expectations
  • the statistic is based on the discrepancies between observed and expected counts

Chi Squared test of goodness of fit (example)

Since the participants of the FAMuSS study where volunteers at a university, they did not come from a “representative” sample of the US population

  • The \(\chi^{2}\) test can be used to test the \(H_{0}\) that the participants are racially representative of the general population

Race

African.American

Asian

Caucasian

Other

Total

FAMuSS (Observed)

27

55

467

46

595

US Census (Expected)

76.16

5.95

478.38

34.51

595

We use the formula \(\chi^{2} = \sum_{k}\frac{(Observed - Expected)^{2}}{Expected}\),
under \(H_{0}\) = the sample proportions should equal the population proportions.

Chi Squared test of goodness of fit (example)

# Subset the vectors of frequencies from the 2 rows  
observed <- c(27,  55,  467, 46)
expected <- c(76.2,  5.95, 478.38,  34.51)

# Calculate Chi-Square statistic manually 
chi_sq_statistic <- sum((observed - expected)^2 / expected) 
df <- length(observed) - 1 
p_value <- 1 - pchisq(chi_sq_statistic, df) 

# Print results 
chi_sq_statistic
[1] 440.2166
df
[1] 3
p_value 
[1] 0

The calculated \(\chi^{2}\) statistic is very large, and the p_value is close to 0. Hence, There is more than sufficient evidence to reject the null hypothesis that the sample is representative of the general population.
A comparison of the observed and expected values (or the residuals) indicates that the largest discrepancy is with the over-representation of Asian participants.

1 CATEGORICAL & 1 NUMERICAL VARIABLES (???)

FROM CORRELATION/ ASSOCIATION TO CAUSATION

Splitting the sample

If we seek any causal relationship between an explanatory and outcome variable we should split our sample to have:

  1. one sample to "train" a model on
  2. one sample to "test" the model on
  • Otherwise out model will seem better than it is (since it will be specifically built to “fit” our data)

  • The function rsample::initial_split will assist in that

nhanes_split <- rsample::initial_split(nhanes)
# default = 75%  of observations 
nhanes_train <- rsample::training(nhanes_split)
# default = 25%  of observations 
nhanes_test <- rsample::testing(nhanes_split)

Visualize the data

We are mainly looking for a “vaguely” linear shape here

  • ggplot2 gives us a visual confirmation via linear best fit (the least squares regression line), using method = lm, & its 95% CI
ggplot(nhanes_train, aes (x = age, 
                          y = bmi)) + 
  geom_point() + 
  geom_smooth(method = lm,  
              #se = FALSE
              )

Visualize the data

Linear regression models

The lm() function is used to fit linear models has the following generic structure:

lm(y ~ x, data)

where:

  • the 1st argument y ~ x specifies the variables used in the model (here the model regresses a response variable \(y\) against an explanatory variable \(x\).
  • The 2nd argument data is used only when the dataframe name is not already specified in the first argument.

Linear regression models syntax

The following example shows fitting a linear model that predicts BMI from age (in years) using data from nhanes_train adult sample (individuals 21 years of age or older from the NHANES data).

#fitting linear model
lm(nhanes_train$bmi ~ nhanes_train$age)
#equivalently...
lm(bmi ~ age, data = nhanes_train)

Call:
lm(formula = bmi ~ age, data = nhanes_train)

Coefficients:
(Intercept)          age  
  28.788819     0.008794  
  • Running the function creates an object (of class lm) that contains several components (model coefficients, etc), either directly displayed or accessible with summary() notation or specific functions.

Linear regression models syntax

We can save the model and then extract individual output elements from it using the $ syntax

# name the model object
lr_model <- lm(bmi ~ age, data = nhanes_train)

# extract model output elements
lr_model$coefficients
lr_model$residuals
lr_model$fitted.values

The command summary returns these elements

  • Call: reminds the equation used for this regression model
  • Residuals: a 5 number summary of the distribution of residuals from the regression model
  • Coefficients:displays the estimated coefficients of the regression model and relative hypothesis testing, given for:
    • intercept
    • explanatory variable(s) slope

Linear regression models interpretation: coefficients

  • The model tests the null hypothesis that a coefficient is 0
  • Coefficients results display: estimate, std. error, t-statistic, p-value that corresponds to the t-statistic for:
    • intercept
    • explanatory variable(s) slope
  • In regression, the population parameter of interest is typically the slope parameter
    • here age doesn’t appear significantly ≠ 0
summary(lr_model)$coefficients 
                Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 28.788818999 1.11925314 25.7214547 2.145870e-84
age          0.008794038 0.02096193  0.4195242 6.750766e-01

Linear regression models interpretation: Residual Standard error

  • Std. Error of coefficients = Residual Standard Error (see below) divided by the square root of the sum of the square of that particular x variable.
  • t-statistic Estimate divided by Std. Error
  • p-value for the t-statistic

Recall that Residual standard error (the average distance that the observed values fall from the regression line) with a twist

# Residual Standard error (Like Standard Deviation)

# ---  inputs 
k = length(lr_model$coefficients)-1 #Subtract one to ignore intercept
n =length(lr_model$residuals)
df = n-(1+k)
# Squared Sum of Errors
SSE =sum(lr_model$residuals**2) # 17375.27

# --- Residual Standard Error
sqrt(SSE/df)  # 6.843509
[1] 6.863851

>>>> qui

https://www.youtube.com/watch?v=ebHLMyqC2UY

Linear regression models outputs: \(R^2\) and \(Adj. R^2\)

summary(lr_model)$r.squared
[1] 0.000475451
summary(lr_model)$adj.r.squared
[1] -0.002225967

Linear regression models outputs: fitted values

lr_model$fitted.values
       1        2        3        4        5        6        7        8 
29.33405 29.08782 29.02626 29.29887 29.41320 29.28129 29.13179 29.21093 
       9       10       11       12       13       14       15       16 
29.21093 29.32526 29.32526 29.15817 29.45717 29.06143 29.19334 29.28129 
      17       18       19       20       21       22       23       24 
29.12299 29.35164 29.03505 29.25490 29.49234 29.25490 29.21973 29.32526 
      25       26       27       28       29       30       31       32 
29.28129 29.36043 29.13179 28.97349 29.07023 29.23731 29.28129 29.14058 
      33       34       35       36       37       38       39       40 
29.16696 29.31646 29.21093 29.39561 29.21093 29.01746 28.97349 29.38681 
      41       42       43       44       45       46       47       48 
29.08782 29.17576 29.37802 29.04385 29.14937 29.29887 29.22852 29.49234 
      49       50       51       52       53       54       55       56 
29.16696 29.11420 29.29008 29.31646 28.98229 29.33405 29.11420 29.28129 
      57       58       59       60       61       62       63       65 
29.29008 29.22852 29.33405 29.02626 29.14058 29.36923 29.11420 29.20214 
      66       67       68       69       70       71       72       73 
29.37802 29.06143 29.28129 29.20214 29.18455 29.43078 29.17576 29.04385 
      74       75       76       77       78       79       80       81 
29.45717 29.02626 29.49234 29.36043 28.97349 29.19334 29.49234 29.09661 
      82       83       84       85       86       87       88       89 
29.20214 29.20214 29.25490 29.29008 29.29008 29.18455 29.09661 29.42199 
      90       91       92       93       94       95       96       97 
29.31646 29.12299 29.04385 29.20214 29.29008 29.29887 29.08782 29.49234 
      98       99      100      101      102      103      104      105 
29.17576 29.49234 29.17576 29.13179 29.37802 29.22852 29.49234 29.49234 
     106      107      108      109      110      111      112      113 
29.07023 28.98229 29.38681 29.01746 29.40440 28.98229 29.00867 29.00867 
     114      115      116      117      118      119      120      121 
29.14937 29.43078 29.26370 29.47475 29.21093 29.10540 29.05264 29.29008 
     122      123      124      125      126      127      128      129 
29.14058 29.27249 29.31646 29.21973 29.40440 29.24611 29.35164 29.44837 
     130      131      132      133      134      135      136      137 
29.25490 29.11420 29.24611 29.30767 29.41320 29.31646 29.18455 29.40440 
     138      139      140      141      142      143      144      145 
29.24611 29.24611 29.49234 29.25490 29.45717 29.27249 28.98229 28.99108 
     146      147      148      149      150      151      152      153 
29.20214 29.22852 29.31646 29.36923 29.49234 29.00867 29.04385 29.10540 
     154      155      156      157      158      159      160      161 
29.14937 29.19334 29.13179 29.33405 29.45717 29.48355 29.40440 29.49234 
     162      163      164      165      166      167      168      169 
29.02626 29.49234 29.06143 29.20214 29.42199 29.09661 29.37802 29.23731 
     170      171      172      173      174      175      176      177 
29.13179 29.04385 29.14058 29.49234 29.02626 29.42199 29.29008 29.34284 
     178      179      180      181      182      183      184      185 
28.99988 29.15817 29.05264 29.33405 29.20214 29.22852 29.03505 29.24611 
     186      187      188      189      190      191      192      193 
29.36923 29.14058 29.18455 29.27249 29.48355 29.31646 29.16696 29.41320 
     194      195      196      197      198      199      200      201 
28.98229 29.31646 28.97349 29.24611 29.46596 29.35164 29.13179 28.97349 
     202      203      204      205      206      207      208      209 
29.30767 28.97349 29.24611 29.26370 28.99988 29.25490 29.26370 29.04385 
     210      211      212      213      214      215      216      217 
29.49234 29.21093 29.29887 29.03505 29.07023 29.25490 29.13179 29.19334 
     218      219      220      221      222      223      225      226 
29.38681 29.41320 29.07902 29.20214 29.07023 29.15817 29.24611 29.15817 
     227      228      229      230      231      232      233      234 
29.13179 29.13179 29.03505 29.26370 29.04385 29.21093 29.25490 29.43958 
     235      236      237      238      239      240      241      242 
29.48355 29.13179 29.38681 28.97349 29.28129 29.00867 29.42199 29.07023 
     243      244      245      246      247      248      249      250 
28.99108 29.20214 29.07023 29.29887 29.17576 29.34284 29.18455 29.24611 
     251      252      253      254      255      256      257      258 
29.27249 29.42199 29.07023 29.14058 28.99108 29.40440 29.49234 29.01746 
     259      260      261      262      263      264      265      266 
29.11420 29.49234 29.29008 29.06143 29.33405 29.14058 29.36923 29.24611 
     267      268      269      270      271      272      273      274 
29.10540 29.38681 29.11420 29.14937 29.13179 29.11420 29.49234 29.25490 
     275      276      277      278      279      280      281      282 
29.41320 29.27249 29.19334 29.29887 28.98229 29.12299 29.20214 29.40440 
     283      284      285      286      287      288      289      290 
29.40440 29.06143 29.17576 29.40440 29.45717 29.09661 29.06143 29.01746 
     291      292      293      294      295      296      297      298 
29.36043 29.31646 29.28129 29.18455 29.36043 29.03505 29.02626 29.45717 
     299      300      301      302      303      304      305      306 
29.22852 29.23731 28.99988 29.26370 29.21093 29.06143 29.41320 29.35164 
     307      308      309      310      311      312      313      314 
29.14058 29.43958 29.03505 29.43958 29.22852 29.42199 29.05264 29.19334 
     315      316      317      318      319      320      321      322 
29.45717 29.43078 29.26370 29.46596 29.12299 28.97349 29.14937 29.42199 
     323      324      325      326      327      328      329      330 
29.29887 29.25490 29.33405 28.99108 29.38681 29.46596 29.35164 29.34284 
     331      332      333      334      335      336      337      338 
29.43078 29.49234 29.22852 29.24611 29.36923 29.07902 29.49234 29.32526 
     339      340      341      342      343      344      345      346 
29.21973 29.44837 29.29887 29.21093 29.36923 29.11420 28.99988 29.44837 
     347      348      349      350      351      352      353      354 
29.36043 29.12299 29.49234 29.03505 28.99108 29.35164 29.26370 29.28129 
     355      356      357      358      359      360      362      363 
29.21093 29.08782 29.36043 29.00867 28.98229 29.00867 29.47475 29.34284 
     364      365      366      367      368      369      370      371 
29.37802 29.22852 29.49234 29.17576 29.07902 29.10540 29.19334 29.13179 
     372      373      374      375 
29.20214 29.14058 29.32526 29.21973 

Linear regression models outputs: residuals

lr_model$residuals 
            1             2             3             4             5 
 -3.574049384  -0.387816307  -4.466258037  -3.198873230   5.716804270 
            6             7             8             9            10 
  2.928714847   3.798213501  -3.560932845  -8.980932845   0.184744655 
           11            12            13            14            15 
 -1.625255345  -1.978168615   1.142834078   2.648565809  -3.843344768 
           16            17            18            19            20 
 -5.561285153  -0.022992461   2.088362539  -3.945052076   4.945096962 
           21            22            23            24            25 
-11.092342076  12.235096962  -7.079726884   9.384744655  -1.981285153 
           26            27            28            29            30 
 -1.910431499   4.288213501  -6.573493807   0.629771770  -7.997314961 
           31            32            33            34            35 
 -4.781285153  -3.740580538  -6.376962653  17.783538693  -6.370932845 
           36            37            38            39            40 
 -3.775607653   9.289067155  -2.917463999   4.266506193   3.233186385 
           41            42            43            44            45 
 -5.387816307  -8.255756691  12.891980424  -5.033846114  -4.769374576 
           46            47            48            49            50 
  9.111126770  -7.228520922  -0.152342076  -0.766962653  10.985801578 
           51            52            53            54            55 
 -6.090079192  -3.116461307   4.667712155   9.285950616  -5.574198422 
           56            57            58            59            60 
 -9.741285153  -3.690079192  -5.648520922  -1.984049384  15.373741963 
           61            62            63            65            66 
 -9.540580538   4.230774462  -9.304198422   0.197861193  -3.778019576 
           67            68            69            70            71 
 -3.301434191  -4.181285153   5.127861193  -3.494550730  -1.630783807 
           72            73            74            75            76 
 -1.875756691   5.656153886  -2.657165922   4.673741963   0.007657924 
           77            78            79            80            81 
 -1.360431499  -0.273493807   5.206655232  -0.092342076  -2.136610345 
           82            83            84            85            86 
 -0.402138807   1.897861193   4.845096962  -6.290079192  -3.190079192 
           87            88            89            90            91 
-10.674550730   0.043389655   0.478010231  24.883538693   8.517007539 
           92            93            94            95            96 
 -2.923846114  10.837861193  -0.160079192  -2.798873230   7.112183693 
           97            98            99           100           101 
 -4.692342076   7.824243309   2.307657924 -10.535756691   1.948213501 
          102           103           104           105           106 
 -4.898019576   7.771479078  -5.112342076   5.027657924   6.329771770 
          107           108           109           110           111 
  8.127712155  -3.486813615   3.482536001   9.995598308   3.017712155 
          112           113           114           115           116 
  3.091330039  -3.508669961  -6.649374576  -7.230783807   4.536302924 
          117           118           119           120           121 
 -3.884753999  -2.110932845  -1.305404384   6.287359847  -7.790079192 
          122           123           124           125           126 
 -8.240580538   6.127508885   3.883538693  -0.319726884  -5.204401692 
          127           128           129           130           131 
 -8.016108999  -2.951637461  -4.038371884  -2.954903038  34.795801578 
          132           133           134           135           136 
 -3.546108999   4.662332732   0.286804270   7.433538693  -7.194550730 
          137           138           139           140           141 
  4.845598308   7.553891001  10.413891001   4.387657924  24.575096962 
          142           143           144           145           146 
 -8.047165922  -6.182491115  -3.482287845  17.708918116  -6.892138807 
          147           148           149           150           151 
  7.361479078  -0.496461307  -8.049225538  -1.892342076  -7.808669961 
          152           153           154           155           156 
  0.756153886  -6.905404384  -1.399374576  -1.693344768   6.968213501 
          157           158           159           160           161 
  0.255950616   2.482834078  -6.823548038   6.695598308  -5.892342076 
          162           163           164           165           166 
  5.043741963   1.207657924  12.538565809   1.897861193  10.578010231 
          167           168           169           170           171 
  3.653389655   0.311980424  -9.037314961   4.168213501  -8.703846114 
          172           173           174           175           176 
 -4.230580538   1.407657924  -5.226258037  -0.161989769   4.199920808 
          177           178           179           180           181 
 -1.642843422  -4.399875922  -7.758168615  -8.582640153   4.365950616 
          182           183           184           185           186 
  9.357861193  -7.228520922  -4.435052076   0.953891001  -2.969225538 
          187           188           189           190           191 
  0.749419462   8.615449270  -3.272491115  -3.023548038  -6.436461307 
          192           193           194           195           196 
 -6.546962653   2.086804270  -6.182287845   0.053538693  -6.113493807 
          197           198           199           200           201 
  9.593891001  -0.915959961   9.568362539  -2.421786499  -5.873493807 
          202           203           204           205           206 
 -7.207667268  -0.673493807  -4.546108999  -8.233697076  -4.699875922 
          207           208           209           210           211 
 -8.324903038  10.436302924  -3.943846114  -0.122342076   5.989067155 
          212           213           214           215           216 
  2.001126770  -6.735052076  -8.770228230   3.785096962 -10.521786499 
          217           218           219           220           221 
  3.506655232  21.943186385   0.666804270  -1.479022268   7.597861193 
          222           223           225           226           227 
 -4.640228230   3.341831385   4.553891001   3.341831385  -0.631786499 
          228           229           230           231           232 
 -2.991786499   4.364947924  -6.353697076   3.556153886  -1.410932845 
          233           234           235           236           237 
 -2.954903038  -6.609577845  -6.293548038   3.518213501  -4.186813615 
          238           239           240           241           242 
 -7.283493807   1.608714847  -6.808669961  -2.741989769  -5.770228230 
          243           244           245           246           247 
 -6.991081884  -7.662138807  -0.270228230 -11.108873230  11.504243309 
          248           249           250           251           252 
  3.857156578  -1.184550730  10.413891001  -2.272491115  -2.741989769 
          253           254           255           256           257 
 -6.300228230   8.649419462  -4.991081884  -3.624401692   0.537657924 
          258           259           260           261           262 
  2.582536001   4.795801578   0.307657924   5.309920808  -4.971434191 
          263           264           265           266           267 
 -8.574049384  -0.150580538   3.960774462  -3.746108999   1.394595616 
          268           269           270           271           272 
  3.213186385  -7.154198422  -6.859374576 -10.331786499   0.785801578 
          273           274           275           276           277 
 -0.862342076  -2.954903038  -2.873195730  -6.252491115   7.946655232 
          278           279           280           281           282 
 -1.438873230 -10.982287845  18.867007539   3.337861193  11.665598308 
          283           284           285           286           287 
  7.195598308  -0.091434191   7.204243309  -0.804401692   0.322834078 
          288           289           290           291           292 
 -8.896610345  -9.061434191   6.222536001  -3.760431499  -4.226461307 
          293           294           295           296           297 
 -2.441285153   4.215449270  -2.920431499   0.364947924  -2.426258037 
          298           299           300           301           302 
 -9.187165922   2.581479078  -4.237314961  10.400124078  18.436302924 
          303           304           305           306           307 
  8.789067155  -1.461434191  14.726804270  -5.151637461   4.759419462 
          308           309           310           311           312 
  0.260422155   6.944947924  -4.239577845  -3.828520922  -0.121989769 
          313           314           315           316           317 
 -1.552640153  -3.343344768   0.642834078  -3.260783807  -1.063697076 
          318           319           320           321           322 
  6.034040039   2.577007539  -8.273493807  -0.949374576  -3.691989769 
          323           324           325           326           327 
  5.001126770   3.045096962   4.365950616   3.558918116  -2.186813615 
          328           329           330           331           332 
  2.434040039  -4.421637461  -5.642843422   2.859216193 -10.092342076 
          333           334           335           336           337 
  1.011479078  -1.846108999   5.920774462  -1.279022268  -0.122342076 
          338           339           340           341           342 
  2.374744655   3.170273116  -1.248371884  -4.788873230  -5.610932845 
          343           344           345           346           347 
 -4.699225538  -1.614198422 -11.199875922  -1.848371884  -4.940431499 
          348           349           350           351           352 
  7.857007539  -4.892342076  -6.665052076  -7.711081884  -2.751637461 
          353           354           355           356           357 
 -8.563697076   4.818714847  -3.640932845  -3.727816307   2.749568501 
          358           359           360           362           363 
 -6.908669961  -1.722287845  11.801330039  -4.874753999  -3.072843422 
          364           365           366           367           368 
  2.971980424  -1.328520922  -4.742342076   0.024243309  39.550977732 
          369           370           371           372           373 
  2.794595616   0.306655232   5.098213501   1.697861193   3.859419462 
          374           375 
 11.974744655  10.860273116 

DIAGNOSTIC PLOTS

The following plots help us assessing that (some of) the assumptions of linear regression are met!

(the independence assumption is more linked to the study design than to the data used in modeling)

ASSUMPTION 1: there exists a linear relationship between the independent variable, x, and the dependent variable, y

For an observation \((x_i, y_i)\), where \(\hat{y}_i\) is the predicted value according to the line \(\hat{y} = b_0 + b_1x\), the residual is the value \(e_i = y_i - \hat{y}_i\)

  • A linear lr_model is a particularly good fit for the data when the residual plot shows random scatter above and below the horizontal line.
    • (In this R plot, we look for a red line that is fairly straight)
# residual plot
plot(lr_model, which = 1 )

DIAGNOSTIC PLOTS

Linear regression diagnostic plots: normality of residuals 2/4

ASSUMPTION 2: The residuals of the model are normally distributed

With the quantile-quantile plot (Q_Q) we can checking normality of the residuals.

# quantile-quantile plot
plot(lr_model, which = 2 )

Linear regression diagnostic plots: normality of residuals 2/4

The data appear roughly normal, but there are deviations from normality in the tails, particularly the upper tail.

Linear regression diagnostic plots: Homoscedasticity 3/4

ASSUMPTION 3: The residuals have constant variance at every level of x (“homoscedasticity”)

This one is called a Spread-location plot: shows if residuals are spread equally along the ranges of predictors

# Spread-location plot
plot(lr_model, which = 3 )

Linear regression diagnostic plots: Homoscedasticity 3/4

Test for Homoscedasticity

Besides visual check, we can perform the Breusch-Pagan test to verify the assumption of homoscedasticity. In this case:

  • \(H_{0}\): residuals are distributed with equal variance

  • \(H_{1}\): residuals are distributed with UNequal variance

  • we use bptest function from the lmtest package

# Breusch-Pagan test against heteroskedasticity 
lmtest::bptest(lr_model)

    studentized Breusch-Pagan test

data:  lr_model
BP = 2.6015, df = 1, p-value = 0.1068
# BP = 1.5798, df = 1, p-value = 0.2088

# Because the test statistic (BP) is small and the p-value is not significant  (p-value > 0.05)  WE DO NOT REJECT THE NULL HYP

Because the test statistic (BP) is small and the p-value is not significant (p-value > 0.05): WE DO NOT REJECT THE NULL HYP

Linear regression diagnostic plots: leverage 4/4

This last diagnostic plot is actually not referred to any assumptions but has to do with outliers:

  • a residuals vs. leverage plot allows us to identify influential observations in a regression model
    • The x-axis shows the “leverage” of each point and the y-axis shows the “standardized residual of each point”, i.e. “How much would the coefficients in the regression model would change if a particular observation was removed from the dataset?”
    • Cook’s distance lines (red dashed lines) – not visible here – appear on the corners of the plot when there are influential cases
plot(lr_model, which = 5 )

Linear regression diagnostic plots: leverage 4/4

In this particular case, there is no influential case, or cases

(Small digression on the broom package)

  • The broom package introduces the tidy approach to regression modeling code and outputs, allowing to convert/save them in the form of tibbles
  • the function augment will show a lot of results for the model attached to each observation
    • this is very useful for further use of such objects, like ggplot2 etc.
# render model as a dataframe 
broom::tidy(lr_model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 28.8        1.12      25.7   2.15e-84
2 age          0.00879    0.0210     0.420 6.75e- 1
# see overal performance 
broom::glance(lr_model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.000475      -0.00223  6.86     0.176   0.675     1 -1243. 2493. 2505.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# save an object with all the model output elements 
model_aug <- broom::augment(lr_model)

You try…

…run View(model_aug) to check

Linear regression performance

We can start looking at how the model performs by applying it to our nhanes_test sub-sample, utilizing the function predict

# recall the dataset size 
nrow(nhanes_test) # 125 
[1] 125
# obtain 125 predicted observations 
predict(lr_model, nhanes_test)
       1        2        3        4        5        6        7        8 
29.03505 29.22852 29.43958 29.01746 29.24611 29.35164 29.43958 29.01746 
       9       10       11       12       13       14       15       16 
29.32526 29.10540 29.35164 29.28129 29.35164 29.18455 29.10540 29.19334 
      17       18       19       20       21       22       23       24 
29.23731 29.29008 29.43078 29.43078 29.35164 29.17576 29.17576 29.23731 
      25       26       27       28       29       30       31       32 
29.21093 29.26370 29.16696 29.36043 29.45717 29.04385 29.42199 29.08782 
      33       34       35       36       37       38       39       40 
29.13179 29.31646 29.12299 29.18455 29.28129 28.99108 29.27249 29.30767 
      41       42       43       44       45       46       47       48 
29.24611 29.41320 28.99108 29.19334 29.28129 29.01746 29.33405 29.30767 
      49       50       51       52       53       54       55       56 
29.04385 29.35164 29.14058 29.39561 28.99988 29.16696 29.10540 29.01746 
      57       58       59       60       61       62       63       64 
29.16696 29.01746 29.49234 29.24611 29.45717 29.07902 29.14937 29.41320 
      65       66       67       68       69       70       71       72 
29.17576 29.49234 28.97349 29.21093 29.07902 29.07902 29.43958 29.20214 
      73       74       75       76       77       78       79       80 
29.21973 29.21973 29.14058 29.11420 29.13179 29.08782 29.21093 29.25490 
      81       82       83       84       85       86       87       88 
29.16696 29.16696 29.28129 29.22852 29.18455 29.24611 29.29887 29.29008 
      89       90       91       92       93       94       95       96 
29.43958 29.01746 29.29008 29.49234 29.05264 29.29887 29.08782 28.99988 
      97       98       99      100      101      102      103      104 
29.18455 28.99988 29.38681 29.03505 29.16696 29.49234 29.17576 29.29008 
     105      106      107      108      109      110      111      112 
29.13179 29.07023 28.97349 29.43958 29.36043 29.26370 29.21093 29.06143 
     113      114      115      116      117      118      119      120 
29.05264 29.04385 29.17576 29.27249 29.19334 29.29887 29.36923 28.99108 
     121      122      123      124      125 
29.22852 28.98229 29.05264 29.15817 29.12299 

Linear regression performance: predicted values in test sample

  • We can look the 95% CI of any predicted values
# obtain 125 predicted observations + 95% 
pred_val_test <- predict(lr_model, nhanes_test, interval = "confidence")
# print out first 6 predicted values and CI boundaries 
head(pred_val_test)
       fit      lwr      upr
1 29.03505 27.86915 30.20095
2 29.22852 28.52826 29.92878
3 29.43958 28.24872 30.63044
4 29.01746 27.78463 30.25030
5 29.24611 28.54402 29.94819
6 29.35164 28.46073 30.24254
  • We can look the CI 95% of a single predicted values
# obtain 125 predicted observations + 95% 
pred_val_test <- predict(lr_model, nhanes_test, interval = "prediction")
# print out first 6 predicted values and CI boundaries 
head(pred_val_test)
       fit      lwr      upr
1 29.03505 15.48774 42.58237
2 29.22852 15.71332 42.74373
3 29.43958 15.89009 42.98906
4 29.01746 15.46423 42.57070
5 29.24611 15.73081 42.76141
6 29.35164 15.82522 42.87806

Linear regression performance: RMSE

Basically we are asking: “how does the prediction compare to the actual test dataset?”

For this we take the difference between the predicted and the actual value as

RMSE = Root Means Squared Error

# --- testing sample 
RMSE <- sqrt(mean(
  (nhanes_test$bmi - predict(lr_model, nhanes_test))^2,
  na.rm = T))

RMSE # 7.451677
[1] 6.687158

This is quite close to the Residual standard error that we got from the regression model summary (6.843) – despite that was taken from training data and this comes from testing data

Linear regression performance: \(R^2\)

#Multiple R-Squared (Coefficient of Determination)
SSyy=sum((nhanes_test$bmi - mean(nhanes_test$bmi, na.rm = T))**2)
SSE=sum(lr_model$residuals**2)
(SSyy-SSE)/SSyy
[1] -2.129884
#Alternatively
1-SSE/SSyy
[1] -2.129884

Linear regression models outputs: model fit

str(summary(lr_model)) 
List of 12
 $ call         : language lm(formula = bmi ~ age, data = nhanes_train)
 $ terms        :Classes 'terms', 'formula'  language bmi ~ age
  .. ..- attr(*, "variables")= language list(bmi, age)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "bmi" "age"
  .. .. .. ..$ : chr "age"
  .. ..- attr(*, "term.labels")= chr "age"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(bmi, age)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "bmi" "age"
 $ residuals    : Named num [1:372] -3.574 -0.388 -4.466 -3.199 5.717 ...
  ..- attr(*, "names")= chr [1:372] "1" "2" "3" "4" ...
 $ coefficients : num [1:2, 1:4] 28.78882 0.00879 1.11925 0.02096 25.72145 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "age"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "age"
 $ sigma        : num 6.86
 $ df           : int [1:3] 2 370 2
 $ r.squared    : num 0.000475
 $ adj.r.squared: num -0.00223
 $ fstatistic   : Named num [1:3] 0.176 1 370
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 2.66e-02 -4.72e-04 -4.72e-04 9.33e-06
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "age"
  .. ..$ : chr [1:2] "(Intercept)" "age"
 $ na.action    : 'omit' Named int [1:3] 64 224 361
  ..- attr(*, "names")= chr [1:3] "64" "224" "361"
 - attr(*, "class")= chr "summary.lm"
names(summary(lr_model))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    
summary(lr_model)$sigma
[1] 6.863851
summary(lr_model)$df
[1]   2 370   2
summary(lr_model)$r.squared
[1] 0.000475451
summary(lr_model)$adj.r.squared
[1] -0.002225967
summary(lr_model)$fstatistic
      value       numdf       dendf 
  0.1760005   1.0000000 370.0000000 
summary(lr_model)$cov.unscaled  ##lm 
              (Intercept)           age
(Intercept)  0.0265901585 -4.721505e-04
age         -0.0004721505  9.326678e-06